13 research outputs found
On the utility and protection of optimization with differential privacy and classic regularization techniques
Nowadays, owners and developers of deep learning models must consider
stringent privacy-preservation rules of their training data, usually
crowd-sourced and retaining sensitive information. The most widely adopted
method to enforce privacy guarantees of a deep learning model nowadays relies
on optimization techniques enforcing differential privacy. According to the
literature, this approach has proven to be a successful defence against several
models' privacy attacks, but its downside is a substantial degradation of the
models' performance. In this work, we compare the effectiveness of the
differentially-private stochastic gradient descent (DP-SGD) algorithm against
standard optimization practices with regularization techniques. We analyze the
resulting models' utility, training performance, and the effectiveness of
membership inference and model inversion attacks against the learned models.
Finally, we discuss differential privacy's flaws and limits and empirically
demonstrate the often superior privacy-preserving properties of dropout and
l2-regularization
Two Steps Forward and One Behind: Rethinking Time Series Forecasting with Deep Learning
The Transformer is a highly successful deep learning model that has
revolutionised the world of artificial neural networks, first in natural
language processing and later in computer vision. This model is based on the
attention mechanism and is able to capture complex semantic relationships
between a variety of patterns present in the input data. Precisely because of
these characteristics, the Transformer has recently been exploited for time
series forecasting problems, assuming its natural adaptability to the domain of
continuous numerical series. Despite the acclaimed results in the literature,
some works have raised doubts about the robustness of this approach. In this
paper, we further investigate the effectiveness of Transformer-based models
applied to the domain of time series forecasting, demonstrate their
limitations, and propose a set of alternative models that are better performing
and significantly less complex. In particular, we empirically show how
simplifying this forecasting model almost always leads to an improvement,
reaching the state of the art among Transformer-based architectures. We also
propose shallow models without the attention mechanism, which compete with the
overall state of the art in long time series forecasting, and demonstrate their
ability to accurately predict extremely long windows. We show how it is always
necessary to use a simple baseline to verify the effectiveness of one's models,
and finally we conclude the paper with a reflection on recent research paths
and the desire to follow trends and apply the latest model even where it may
not be necessary
Heterogeneous Datasets for Federated Survival Analysis Simulation
Heterogeneous Datasets for Federated Survival Analysis Simulation
This repo contains three algorithms for constructing realistic federated datasets for survival analysis. Each algorithm starts from an existing non-federated dataset and assigns each sample to a specific client in the federation. The algorithms are:
uniform_split: assigns each sample to a random client with uniform probability;
quantity_skewed_split: assigns each sample to a random client according to the Dirichlet distribution [3, 4];
label_skewed_split: assigns each sample to a time bin, then assigns a set of samples from each bin to the clients according to the Dirichlet distribution [3, 4].
For more information, please take a look at our paper at https://arxiv.org/abs/2301.12166 [1].
Content
federated_survival_datasets.zip: the content of the repository at https://github.com/archettialberto/federated_survival_datasets
Heterogheneous_Datasets_for_Federated_Survival_Analysis_Simulation.pdf: the conference paper describing the work.
Installation
Federated Survival Datasets is built on top of numpy and scikit-learn. To install those libraries you can run pip install -r requirements.txt. To import survival datasets into your project, we strongly recommend SurvSet (https://github.com/ErikinBC/SurvSet) [2], a comprehensive collection of more than 70 survival datasets.
Usage
import numpy as np
import pandas as pd
from federated_survival_datasets import label_skewed_split
# import a survival dataset and extract the input array X and the output array y
df = pd.read_csv("metabric.csv")
X = df[[f"x{i}" for i in range(9)]].to_numpy()
y = np.array([(e, t) for e, t in zip(df["event"], df["time"])], dtype=[("event", bool), ("time", float)])
# run the splitting algorithm
client_data = label_skewed_split(num_clients=8, X=X, y=y)
# check the number of samples assigned to each client
for i, (X_c, y_c) in enumerate(client_data):
print(f"Client {i} - X: {X_c.shape}, y: {y_c.shape}")
We provide an example notebook in the zipped folder to illustrate the proposed algorithms. It requires scikit-survival, seaborn, and pandas.
References
[1] Archetti, A., Lomurno, E., Lattari, F., Martin, A., & Matteucci, M. (2023). Heterogeneous Datasets for Federated Survival Analysis Simulation. arXiv preprint arXiv:2301.12166.
[2] Drysdale, E. (2022). SurvSet: An open-source time-to-event dataset repository. arXiv preprint arXiv:2203.03094.
[3] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.
[4] Li, Q., Diao, Y., Chen, Q., & He, B. (2022, May). Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE) (pp. 965-978). IEEE
POPNASv2: An Efficient Multi-Objective Neural Architecture Search Technique
Automating the research for the best neural network model is a task that has gained more and more relevance in the last few years. In this context, Neural Architecture Search (NAS) represents the most effective technique whose results rival
the state of the art hand-crafted architectures.
However, this approach requires a lot of computational capabilities as well as research time, which make prohibitive its usage in many real-world scenarios.
With its sequential model-based optimization strategy, Progressive Neural Architecture Search (PNAS) represents a possible step forward to face this resources issue. Despite the quality of the found network architectures, this technique is still limited in research time.
A significant step in this direction has been done by Pareto-Optimal Progressive Neural Architecture Search (POPNAS), which expand PNAS with a time predictor to enable a trade-off between search time and accuracy, considering a multi-objective optimization problem.
This paper proposes a new version of the Pareto-Optimal Progressive Neural Architecture Search, called POPNASv2.
Our approach enhances its first version and improves its performance.
We expanded the search space by adding new operators and improved the quality of both predictors to build more accurate Pareto fronts.
Moreover, we introduced cell equivalence checks and enriched the search strategy with an adaptive greedy exploration step.
Our efforts allow POPNASv2 to achieve PNAS-like performance with an average 4x factor search time speed-up.
The official version of this tool is located in the following link: AndreaFalanti/popnas-v2 (github.com
SGDE: Secure Generative Data Exchange for Cross-Silo Federated Learning
Privacy regulation laws, such as GDPR, impose transparency and security as
design pillars for data processing algorithms. In this context, federated
learning is one of the most influential frameworks for privacy-preserving
distributed machine learning, achieving astounding results in many natural
language processing and computer vision tasks. Several federated learning
frameworks employ differential privacy to prevent private data leakage to
unauthorized parties and malicious attackers. Many studies, however, highlight
the vulnerabilities of standard federated learning to poisoning and inference,
thus raising concerns about potential risks for sensitive data. To address this
issue, we present SGDE, a generative data exchange protocol that improves user
security and machine learning performance in a cross-silo federation. The core
of SGDE is to share data generators with strong differential privacy guarantees
trained on private data instead of communicating explicit gradient information.
These generators synthesize an arbitrarily large amount of data that retain the
distinctive features of private samples but differ substantially. In this work,
SGDE is tested in a cross-silo federated network on images and tabular
datasets, exploiting beta-variational autoencoders as data generators. From the
results, the inclusion of SGDE turns out to improve task accuracy and fairness,
as well as resilience to the most influential attacks on federated learning
Identification and characterization of learning weakness from drawing analysis at the pre-literacy stage
: Handwriting learning delays should be addressed early to prevent their exacerbation and long-lasting consequences on whole children's lives. Ideally, proper training should start even before learning how to write. This work presents a novel method to disclose potential handwriting problems, from a pre-literacy stage, based on drawings instead of words production analysis. Two hundred forty-one kindergartners drew on a tablet, and we computed features known to be distinctive of poor handwriting from symbols drawings. We verified that abnormal features patterns reflected abnormal drawings, and found correspondence in experts' evaluation of the potential risk of developing a learning delay in the graphical sphere. A machine learning model was able to discriminate with 0.75 sensitivity and 0.76 specificity children at risk. Finally, we explained why children were considered at risk by the algorithms to inform teachers on the specific weaknesses that need training. Thanks to this system, early intervention to train specific learning delays will be finally possible
On the utility and protection of optimization with differential privacy and classic regularization techniques
Nowadays, owners and developers of deep learning models must consider
stringent privacy-preservation rules of their training data, usually
crowd-sourced and retaining sensitive information. The most widely adopted
method to enforce privacy guarantees of a deep learning model nowadays relies
on optimization techniques enforcing differential privacy. According to the
literature, this approach has proven to be a successful defence against several
models' privacy attacks, but its downside is a substantial degradation of the
models' performance. In this work, we compare the effectiveness of the
differentially-private stochastic gradient descent (DP-SGD) algorithm against
standard optimization practices with regularization techniques. We analyze the
resulting models' utility, training performance, and the effectiveness of
membership inference and model inversion attacks against the learned models.
Finally, we discuss differential privacy's flaws and limits and empirically
demonstrate the often superior privacy-preserving properties of dropout and
l2-regularization